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Abstract: The Web Mining applications have need to be improved with the specific algorithms for the document 
classification. This paper emphasizes the importance of using appropriate measures and methods for the evaluate of 
the Web document classification performance. We focus on methods that evaluate how well a classifier performs. The 
effect of transformations on the confiision matrix are considered for eleven well-known and recently introduced 
classification measures. We analyze the measure's ability to retain its value under changes in a confusion matrix. We 
discuss benefits from the use of the invariant and non-invariant measures with respect to characteristics of data 
classes. 
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1. Introduction 



1.1 Web Document Classification 

The Web document classification means the assigning 
a document to a labeled predefine category. In the 
context related to informatic technologies applied in 
industries, business and economics in general, 
document classification is the process of the 
establishing a technical standard among competing 
entities in a market, where it will bring benefits 
without hurting competition. It can also be viewed as a 
mechanism for economic activity optimising. 
The document classification tasks is divide in two 
categories: the supervised document classification 
where a extern tool (such as the human corecting 
reaction) provide the information of the corect 
classification for the documents and unsupervised 
document classification where the classification can be 
gived without to refer at extern information. 
The work flow of the document classification process 
is illusfrate in figure 1. 
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Fig. 1. The flow chart of the processings in the 
document classification process. [18] 



Document classification process incudes the phases as 
follows: 

• Preprocessing: transform documents into a 
suitable representation for classification task (i.e., 
remove HTML or other tags, remove stopwords, 
perform word stemming - remove suffix); 

• Indexing by different weighing schemes: 
Boolean weighing, word frequency weighing, tf*idf 
weighing, Itc weighing, Enfropy weighing, etc.; 

• Feature selection: remove non-informative 
terms from documents that improve classification 
effectiveness and reduce computational complexity; 

• Classification algorithms: Rocchio's 
algorithm, k-Nearest-Neighbor algorithm (KNN), 
Decision Tree algorithm (DT), Naive Bayes algorithm 
(NB), Artificial Neural Network algorithm (ANN), 
Support Vector Machine algorith (SVM), Voting 
algorithm, etc.; 

Performance of algorithm: Training time. Testing 
time. Classification accuracy (precision, recall, F- 
score, micro-medie/macro- medie, etc.). The goal of 
this phase is a high classification quality and 
computation efficiency. 

A classifier is a mapping from a (discrete or 
continuous) feature space to a discrete set of labels 
Y. A framework of the classifier into Web document 
classification process is presented in figure 2. 
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Fig. 2 A Framework of a classifier in two phases [19]. 



1.2 Techniques of the Web Document 
Classification 

Tlic arc lii lecture of a operational classification system 
in tlic automatic document classification is illustrated 
in figure 3. 



Figure 3. Operational system architecture of the 
document classification process. [19]. 
For the both task groups there are o vary document 
classification technics which improves allways The 
main category of the classification technics there are: 
Naive Bayes classifier, TF-IDF (Term Frequency - 
Inverse Document Frequency), Latent Semantic 
Indexing (LSI), Support Vector Machine (SVM), 
Artificial Neural Network (ANN), k-Nearest Neighbor 
(kNN), Concept Mining, and approches based on 
natural language processing. 



1.3 Web Mining and Web Document 
Classification 

Web Mining is the extraction of interesting and 
potentially usefiil patterns and implicit information 
fi-om artifacts or activity related to the World Wide 



Web. There are roughly three knowledge discovery 
domains that pertain to web mining: Web Content 
Mining, Web Structure Mining, and Web Usage 
Mining. In Figure 4 we illustrate the taxonomy of web 
mining [1]. 

Web content mining is the process of extracting 
knowledge fi^om the content of documents or their 
descriptions. Web document text mining, resource 
discovery based on concepts indexing or agent-based 
technology may also fall in this category. Web 
structure mining is the process of inferring knowledge 
from the World Wide Web organization and links 
between references and referents in the Web. Finally, 
web usage mining, also known as Web Log Mining, is 
the process of extracting interesting patterns in web 
access logs. 




Fig. 4. Web Mining Taxonomy. 



2 A Review of the Evaluation IVIetrics in 
classification 

Most evaluation metrics in classification are designed 
to reward class unifonnity in the example subsets 
induced by a feature (e.g., Information Gain). Other 
metrics are designed to reward discrimination power 
in the context of feature selection as a means to 
combat the feature-interaction problem (e.g.. Relief, 
Contextual Merit). 

2.1 Purity-based Metrics 

An evaluation metric M quantifies the quality of the 
partitions induced by a feature over a set of training 
examples T. Purity-based or traditional metrics define 
Mby measuring the amount of class uniformity gained 
by decomposing T into the set of example subsets 
{r„} induced by X^.. Let P be the vector of class 
probabilities estimated fi"om the data in the complete 
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set T, let be the corresponding vector of class 
probabilities estimated from the data in the induced 
subset r„, and let / be a measure of the impurity of a 
class probability vector. M is typically defined as 
follows [7]: 



M{X,) = l(P^ 



(1) 



Different variations of M can be obtained by 
changing the impurity function I. For example, for 

Information Gain [8], impurity is defined in terms of 
entropy: 

Ien,ropy(P) = -TPil^S2Pi (2) 

Another example is Gini Index [9] 

1,4^) = -Hp! (3) 

Equation (1) covers most traditional metrics, but there 

are two major limitations: 

• first is a tendency to favor features with many 
values. Inducing many example subsets increases 
the probability of finding class-unifonn subsets, 
but at the expense of ovcrfitting. Several solutions 
have already been proposed for this problem [3]; 

• second is the inability to detect the relevance of a 
feature when its contribution to the target concept 
is hidden by combinations with other features, also 
known as the feature-interaction problem [10]. 

To attack the feature-interaction problem additional 
information besides class probabilities is required. 

2. 2 Discrimination-based Metrics 
A different kind of evaluation metric considers the 
discrimination power of each feature, i.e., the ability of 
a feature to separate examples of different class. Let 
X. and Xj be two examples lyingclose to each other 
according to some distance measure D. Feature X/^ is 
awarded some amount of discrimination power if it 
takes on different values when the class values of X- 
and Xj differ, i.e., when = xl and 
C{^X.^^ C{^Xj^. The more often this condition is 
true for pairs of nearby examples, the higher the 
quality of feature X,-. 

Two representative examples of discrination-based 
metrics are Contextual Merit and Relief [10]. Before 
describing them, we define the distance between to 
examples as follows: 



meric feature 

d\x[,xl\ 



D(x,,X^) = Y,d(x[,xl) (4) 
For nominal features d (^x^ , x^ ^ is defined as 

For numeric features d {^x[ , xl ) is defined 

'MM) 

where TH is a normalization factor, e.g., 
MAX(X,) - MIN(Js:,) (diference between the 
maximum and minimum values observed for feature 
in T ). 

Different metrics are obtained by varying the update 
function (4). The Relief algorithm, for example, 
updates score gj. as follows: 



(5) 



(6) 



q,+d(xi.xi) ifC(x,)^c(Xj) 
q,-d(xi,xi) ifC(x,) = C(Xj) 



(7) 



Thus Relief updates q,- whenever the feature values of 
two neighbor examples differ; the score increases if 
their class values differ and decreases if they are the 
same. Contextual Merit updates gj; when both feature 
values and class values differ; it uses the update 
function: 

^.=^.+^ rc{i,>c(x,) (8) 

The score of a feature decreases quadratically with the 
distance between two examples [11]. 



3. Performance Evaluation of the Web 
Document Classification 

The perfonnance of a classifier can be measured or 
estimated in a number of different ways. Which 
method to use is still subject to research and depends 
on the type of classification and type of data to be 
classified. 



3.1 Web Document Classification and 
Performance Measures 

Quality of classification can be assessed using a 
confusion matrix, i.e., records of correctly and 
incorrectly recognized examples for each class. Table 
1 reports on binary classification [12]. 
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I'lviliclcil Class 






Class=Yes 


Class=No 


Acliuil 


Class=Yes 


tp 


fn 




Class=No 


fp 


tn 



Table 1. A confusion matrix for binary classification 
The confusion matrix includes: tp = true positive, ^= 
false negative,^^ = false positive ^\ tn = true negative. 
The retrieval of relevant documents, or a positive 
class, is the most important task, thus focus is on tp 
classification. Importance of retrieval of positive 
examples is reflected by the choice of performance 
measures for text classification: 
tp 



Pr ecision = 



tp + fp 



tp 

tp + fn 



(/3'+l)tp 



(p'+l)tp + p'fn + fp 



(9) 
(10) 
(11) 



tp 



tp 



BreakEvenPo int = — — — = — — — (12) 
tp + fp tp + fn 
Three measures evaluate the classifier performance by 
calculating the ratio of correctly classified positive 
examples to examples labeled as positives {Precision), 
positive examples in data {Recall ), and total positive 
examples, labeled and from data, {Fscore). 
BreakEvenPoint essentially estimates when 
disagreement between data and algorithm labeling of 
positive examples is balanced {fp = fn). All these 
measures omit tn in their formulas, thus do not 
consider correct classification of negative examples. 
In [14] Lee et al presents: the retrieval of a positive 
class, discrimination between classes, balance between 
retrieval of both classes are possible tasks whose 
importance depends on the problem at hand. So far, 
there is no common understanding on the choice of 
measures used to evaluate perfonnance of classifiers 
in Web document. Employed performance measures 
are either 

tp + tn 



Accuracy = 



(13) 



tp + fn + jp + tn 
wich is used in [14] and other works by this group, or 
Precision, Recall, Fscore, or corespondence between 

tp 



Sensitivity = — - — = Recall 
tp + fn 



and Specificity = 



fp + tn 



(14) 



(15) 



reported in [16]. With different measures in use, it is 
important to know how performance evaluations, 
produced by those measures, relate to each other. 



3.2 Invariance Properties of the Measures 

Finding appropriate measure is possible by 
establishing how comparable are the involved 
measures. Following [17], we focus on the ability of a 
measure to preserve its value under a change in a 
confusion matrix. The invariance of a measure signals 
that it does not detect this change. Depending on the 
learning goals, non detection can be beneficial or 
adverse. 

For instance, text classification extensively uses 
Precision and Recall {Sensitivity). These measures do 
not detect changes in tn, when all other matrix entries 
remain the same. In document classification, a large 
number of unrelated documents constitutes a negative 
class that lacks unifying characteristics (a multimodal 
negative class). The criterion for the performance of 
the classifier is its performance on related documents 
(a well-defined, unimodal. positive class) and may not 
depend on tn. Precision and Recall depend on tp, 
which shows agreement between data and algorithm 
labeling of positive examples, and fp and fn, which 
show disagreement between data and algorithm 
labeling of positive examples. Thus these measures 
provide the most important perspective on classifiers' 
performance for document classification. Another 
emerging application of text classification, 
classificafion of consumer reviews, works with highly 
related documents constituting unimodal positive and 
negative classes. Thus the evaluation measure may 
depend on classification of negative examples and 
reflect the tn change, when other matrix elements stay 
the same. 

We examine the invariance properties with respect to 
basic changes of a matrix. Our claim is that the 
following invariance properties affect the measure's 
applicability and trustworthiness: 
Exchange of tp with tn and /« with 7^ (tl) Table 2 
shows the confusion matrix after the changes to the 
confusion matrix reported in Table 1. A measure is 
invariant if 

m{tp: fn: tn: fp) = m{tn: fp: tp: fn) { 1 6) 

This shows measure pennanence with respect to 
classification results distribution. If the measure is 
invariant, then it does not distinguish tp fi'om tn and fn 
fi"om fj? and may not recognize asymmetry of 
classification results. Thus it may not be trustworthy 
when classifiers are compared on data sets with 
different and/or unbalanced class distributions. For 
example, invariant measures may be more appropriate 
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for assessment of classification of consumer reviews 
then for document classification. 
Change of true negative count (t2) Table 3 presents the 
resulting confusion matrix. A measure is invariant if 

mitp; fn; tn; fp) = m(tp; fn; tn ' ; fp) (17) 
This measure does not recognize specifying ability of 
classifiers. Such evaluation may be more applicable to 
domains with a multi-modal negative class, built as 
everything not positive". 





Predicted CkiN^ 


Acliiiil 




Class=Yes 


Class=No 


Class=Yes 


tn 


fp 


Class=No 


fn 


tp 


Table 2. Confusion matrix after the exchange 
with tn and fn w ith fp. 




Predicted Class 


Actual 




Class=Yes 


Class=No 


Class=Yes 


tp 


fn 


Class=No 


fp 


tn 



Table 3. Confusion matrix after a c 

negative count. 

If the measure is non-invariant, has t2, then it 
acknowledges ability of classifiers correctly identify 
negative examples. If the measure is able to do this, it 
may be reliable for comparison in domains with a 
well-defined, unimodal, negative class. In case of text 
classification, these invariant measures are suitable for 
evaluation of document classification and non- 
invariant measures are preferable for evaluation of 
such communications where criteria exist for 
positive as well as for negative results. 
Change of a false count (t3) Table 4 reports the 
confusion matrix. A measure is invariant if 

m(tp;fn; tn;fp) = m{tp;fn; tn;fp ) (18) 
t3 indicates measure constancy if disagreement 
increases between the data and classifier labels. An 
invariant measure shows preference for data labels. In 
case of unreliable data labeling such measure may give 





Predicted Class 


,\lIu;i1 
Class 
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Class=No 


Class=Yes 


tp 


fn 


Class=No 


fp 





Table 4. Confusion matrix after a change in false 
positive count. 

A non-invariant measure may not be suitable for data 
with many counter examples. If classifier ranking 



improves when fp increases, the measure may favor a 

classifier prone to faux positives. 

In case of t3, the use of invariant and non-invariant 

measures might be decided based on problem and data 

characteristics. 

Classification scaling (t4) Table 5 presents the 
confusion matrix. A measure is invariant if 





Predicted Class 


Actual 
t lass 




Class=Ycs 


Class=No 


Class=Ycs 


kjtp 


k.Jn 


Class=No 


kjp 


k2tn 



Table 5. Confusion matrix after scaling. 
This shows measure uniformity with respect to 
proportional changes of classification results. If the 
measure is non-invariant, then its applicability may 
depend on class sizes. If we expect that for different 
data sizes the same portion of examples exhibits 
positive (negative) characteristics, then the invariant 
measure may be a better choice for classifiers' 
evaluation. The non-invariant measures may be more 
reliable if we do not know how representative is the 
data sample in terms of proportion positive/negative 
examples (which is might be the case in web-posted 
consumer reviews). 



4 Conclusion 

Much work has been done in the research of classifier 
performance evaluation, comparison and classifier 
performance optimization, though the conclusion that 
can be drawn after conducting the literature survey is 
that most articles only focus on one optimization 
technique or one learning algorithm. Furthermore 
there are often discussions in the literature about 
which learning algorithm to use given a specific class 
of problem. 

We have analyzed applicability of performance 
measures to different subfields of text classification. 
We have shown that document classification differs 
fi-om classification of human communications, thus 
that these two types of text classification may require 
different set of performance measures. We have shown 
that the results of the classifier comparison depend on 
a number of factors, including invariant properties of 
the measures. We have considered effects of various 
transformations of the confiision matrix on several 
well-known performance measures. The invariance 
properties have lead to fine distinctions of relations 
between the measures and the data characteristics. One 
way to insure reliable evaluation is to employ a 
measure corresponding to the learning setting. The 
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next step would be to expand the list of connections 
between learning settings and evaluation measures. 
This approach opens new directions for future work. 
First, we built a framework for the two-dimensional 
relations "measure vs invariance" and omitted decision 
theory relations. Note that the listed measures evaluate 
different decision aspects of the classifier 
performance. 
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